Goto

Collaborating Authors

 similarity query


where the last inequality follows from the fact that Uij 1. Also, for any i [n ] and j [k], we have xi bµj

Neural Information Processing Systems

To prove Lemma 2 we start by proving a few inequalities. Since Ais an ( 1, 2,Q)-solver, using Definition 4 and Taylor's expansion, we get for any i [n] and j [k], In this section we present and prove a few auxiliary results which will be used in the proofs our main results. We start with the following standard concentration inequalities. R2, (32) if n clog(1/δ)2, where c > 0 is some absolute constant. The following locality lemma states that the fuzzy k-means function is strictly increasing. Lemma 5. Let (X,P?) be a clustering instance, where P? refers to the optimal solution for the fuzzy k-mean problem (namely, minimizes the objective in (2)). Output: bµj 1: Initialize S φ. 2: for s= 1,2,...,mdo 3: Sample iuniformly at random from [n] and update S S {i}. Next, we analyze the performance of Algorithm 6, which estimates the center of a given cluster using a set of randomly sampled elements. Note that this algorithm is used as a sub-routine in Algorithm 1. Lemma 6 (Estimate of mean using uniform sampling). Let (X,P) be a consistent center-based clustering instance, and let δ (0,1).


Fuzzy Clustering with Similarity Queries

Neural Information Processing Systems

The fuzzy or soft k-means objective is a popular generalization of the well-known kmeans problem, extending the clustering capability of the k-means to datasets that are uncertain, vague and otherwise hard to cluster. In this paper, we propose a semisupervised active clustering framework, where the learner is allowed to interact with an oracle (domain expert), asking for the similarity between a certain set of chosen items. We study the query and computational complexities of clustering in this framework. We prove that having a few of such similarity queries enables one to get a polynomial-time approximation algorithm to an otherwise conjecturally NP-hard problem. In particular, we provide algorithms for fuzzy clustering in this setting that ask O(poly(k)logn)similarity queries and run with polynomialtime-complexity, where nis the number of items. The fuzzy k-means objective is nonconvex, with k-means as a special case, and is equivalent to some other generic nonconvex problem such as non-negative matrix factorization. The ubiquitous Lloyd-type algorithms (or alternating-minimization algorithms) can get stuck at a local minima. Our results show that by making few similarity queries, the problem becomes easier to solve. Finally, we test our algorithms over real-world datasets, showing their effectiveness in real-world applications.



Fuzzy Clustering with Similarity Queries

Neural Information Processing Systems

The fuzzy or soft $k$-means objective is a popular generalization of the well-known $k$-means problem, extending the clustering capability of the $k$-means to datasets that are uncertain, vague and otherwise hard to cluster. In this paper, we propose a semi-supervised active clustering framework, where the learner is allowed to interact with an oracle (domain expert), asking for the similarity between a certain set of chosen items. We study the query and computational complexities of clustering in this framework. We prove that having a few of such similarity queries enables one to get a polynomial-time approximation algorithm to an otherwise conjecturally NP-hard problem. In particular, we provide algorithms for fuzzy clustering in this setting that ask $O(\mathsf{poly}(k)\log n)$ similarity queries and run with polynomial-time-complexity, where $n$ is the number of items. The fuzzy $k$-means objective is nonconvex, with $k$-means as a special case, and is equivalent to some other generic nonconvex problem such as non-negative matrix factorization. The ubiquitous Lloyd-type algorithms (or alternating-minimization algorithms) can get stuck at a local minima. Our results show that by making few similarity queries, the problem becomes easier to solve. Finally, we test our algorithms over real-world datasets, showing their effectiveness in real-world applications.



Fuzzy Clustering with Similarity Queries

Neural Information Processing Systems

The fuzzy or soft k -means objective is a popular generalization of the well-known k -means problem, extending the clustering capability of the k -means to datasets that are uncertain, vague and otherwise hard to cluster. In this paper, we propose a semi-supervised active clustering framework, where the learner is allowed to interact with an oracle (domain expert), asking for the similarity between a certain set of chosen items. We study the query and computational complexities of clustering in this framework. We prove that having a few of such similarity queries enables one to get a polynomial-time approximation algorithm to an otherwise conjecturally NP-hard problem. In particular, we provide algorithms for fuzzy clustering in this setting that ask O(\mathsf{poly}(k)\log n) similarity queries and run with polynomial-time-complexity, where n is the number of items.


Leveraging Chemistry Foundation Models to Facilitate Structure Focused Retrieval Augmented Generation in Multi-Agent Workflows for Catalyst and Materials Design

arXiv.org Artificial Intelligence

Molecular property prediction and generative design via deep learning models has been the subject of intense research given its potential to accelerate development of new, high-performance materials. More recently, these workflows have been significantly augmented with the advent of large language models (LLMs) and systems of LLM-driven agents capable of utilizing pre-trained models to make predictions in the context of more complex research tasks. While effective, there is still room for substantial improvement within the agentic systems on the retrieval of salient information for material design tasks. Moreover, alternative uses of predictive deep learning models, such as leveraging their latent representations to facilitate cross-modal retrieval augmented generation within agentic systems to enable task-specific materials design, has remained unexplored. Herein, we demonstrate that large, pre-trained chemistry foundation models can serve as a basis for enabling semantic chemistry information retrieval for both small-molecules, complex polymeric materials, and reactions. Additionally, we show the use of chemistry foundation models in conjunction with image models such as OpenCLIP facilitate unprecedented queries and information retrieval across multiple characterization data domains. Finally, we demonstrate the integration of these systems within multi-agent systems to facilitate structure and topological-based natural language queries and information retrieval for complex research tasks.


Fuzzy Clustering with Similarity Queries

arXiv.org Machine Learning

The fuzzy or soft $k$-means objective is a popular generalization of the well-known $k$-means problem, extending the clustering capability of the $k$-means to datasets that are uncertain, vague, and otherwise hard to cluster. In this paper, we propose a semi-supervised active clustering framework, where the learner is allowed to interact with an oracle (domain expert), asking for the similarity between a certain set of chosen items. We study the query and computational complexities of clustering in this framework. We prove that having a few of such similarity queries enables one to get a polynomial-time approximation algorithm to an otherwise conjecturally NP-hard problem. In particular, we provide probabilistic algorithms for fuzzy clustering in this setting that asks $O(\mathsf{poly}(k)\log n)$ similarity queries and run with polynomial-time-complexity, where $n$ is the number of items. The fuzzy $k$-means objective is nonconvex, with $k$-means as a special case, and is equivalent to some other generic nonconvex problem such as non-negative matrix factorization. The ubiquitous Lloyd-type algorithms (or, expectation-maximization algorithm) can get stuck at a local minima. Our results show that by making few similarity queries, the problem becomes easier to solve. Finally, we test our algorithms over real-world datasets, showing their effectiveness in real-world applications.